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In a method and an apparatus for identifying two nucleic acid base code sequences belonging to a given set of known base code 
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WO 96/12822 PCIYSE95/01213 

METHOD FOR INDENHFY1NG TWO NUCLEIC ACID BASE CODE SEQUENCES 
TECHNICAL FIELD 

The invention relates to a method and an apparatus for 

identifying two nucleic acid base code sequences belonging to 

5 a given set of known base code sequences and being superposed 

on each other in an original sequence which comprises base 

codes as well as ambiquity codes. 

BACKGROUND OF THE INVENTION 

Such a method is known from Erik H. Rozerouller et al 
"Assignment of HLA-DPB alleles by computerized matching based 
upon sequence data", Human Immunology 37, 207-212 (1993). 

According to the known method , a data base containing all 
known HLA-DPB sequences, makes it possible to analyze hetero- 
zygous individuals by combinatorial comparison through all 
base code sequences and thus identify the one or two base 
code sequences involved. The HLA-DPB sequences in the data 
base are selected from published sequences (Marsh S.G.E., 
Bodmer J.G.; "HLA class II nucleotide sequences", 1992, 
Tissue Antigens 40:229, 1992). 

A disadvantage with the known method is its inability to 
handle artifacts in terms of inserted or removed base codes 
in a test sequence. 

Moreover, the known method is time consuming and involves 
a great amount of data. 

BROAD DESCRIPTION OF THE INVENTION 

The object of the invention is to bring about a method 
which is less sensitive to artifacts in non-crucial parts of 
30 a test sequence produced by sequencing equipment, when 

analyzing low quality samples, the artifacts being described 
in terms of inserted, removed and exchanged base codes and 
ambiguity codes, and which is less time consuming and in- 
volves less data than the known method, as well as an appara- 
35 tus for carrying that method into effect. 

This is attained by a first embodiment of the method 
according to the invention for identifying two nucleic acid 
base code sequences belonging to a given set of known base 
code sequences and being superposed on each other in an 
4 0 original sequence which comprises base codes as well as 
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ambiquity codes , in that it comprises the steps of 

a) constructing a master template sequence from said given 
set of base code sequences by assigning every conserved 
position , where the base code is the same all through the 

5 set, that particular base code in said master template 

sequence, and assigning every non-conserved position, where 
the base code differs through the set, a wild-card code in 
said master template sequence, 

b) extracting from every base code sequence of said given 
10 set, the non-conserved positions to obtain non-conserved 

position subsequences containing only the non-conserved base 
codes , 

c) superposing in pairs all possible combinations of the non- 
conserved position sequences extracted in step b) to obtain 

15 combination sequences of base codes and ambiguity codes, 

d) making a determination of the original sequence in order 
to obtain a test sequence, 

e) aligning said test sequence against said master template 
sequence in such a manner that, accepting gaps in either 

20 sequence, the matching between them is optimized, said wild- 
card coded non-conserved positions in said master template 
sequence being considered as matching any base code and any 
ambiguity code in said test sequence, 

f) extracting from said test sequence all base codes and 

25 ambiguity codes which are aligned with the wild-card codes in 
said master template sequence, and 

g) comparing the base codes and ambiguity codes extracted in 
step f) with all the combination sequences of base codes and 
ambiguity codes obtained in step c) , a match between one of 

30 said combination sequences obtained in step c) and the base 
codes and ambiguity codes extracted in step f ) , indicating 
that that particular combination sequence of base codes and 
ambiguity codes corresponds to said two nucleic acid base 
code sequences to be identified. 

35 This is also attained by a second embodiment of the 

method according to the invention for identifying two nucleic 
acid base code sequences belonging to a given set of known 
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base code sequences and being superposed on each other in an 
original sequence which comprises base codes as well as 
ambiguity codes, in that it comprises the steps of 

a) constructing a master template sequence from said given 
5 set of base code sequences by assigning every conserved 

position, where the base code is the same all through the 
set, that particular base code in said master template 
sequence, and assigning every non-conserved position, where 
the base code differs through the set, a wild-card code in 
10 said roaster template sequence, 

b) extracting from every base code sequence of said given 
set, the non-conserved positions to obtain non-conserved 
position subsequences containing only the non-conserved base 
codes, 

15 c) superposing, in pairs, all possible combinations of the 
non-conserved position sequences extracted in step b) to 
obtain combination sequences of base codes and ambiguity 
codes , 

d) making one or more determinations of the original sequence 
20 in order to obtain one or more test sequences, 

e) aligning each of said one or more test sequences against 
said master template sequence in such a manner that, accep- 
ting gaps in either sequence, the matching between the master 
template and each test sequence is optimized, said wild-card 

25 coded non-conserved positions in said master template sequen- 
ce being considered as matching any base code and any am- 
biguity code in each test sequence, 

f ) extracting from each of said test sequences all base codes 
and ambiguity codes which are aligned with the wild-card 

30 codes in said master template sequence, 

g) determining, for each non-conserved position, a consensus 
base code or ambiguity code on the basis of the non-conserved 
bases extracted from each test sequence by summing up a score 
for each base code for each non-conserved position and 

35 keeping the base code with the highest score, the score being 
a function of the position of the base code in the respective 
test sequence as well as of the local quality of the align- 
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ment between the respective test sequence and said master 
template sequence, and 

h) comparing the consensus base codes and ambiguity codes 
determined in step g) with all the combination sequences of 
5 base codes and ambiguity codes obtained in step c) , a match 
between one of said combination sequences obtained in step c) 
and the consensus base codes and ambiguity codes determined 
in step g) , indicating that that particular combination 
sequence of base codes and ambiguity codes corresponds to 

10 said two nucleic acid base code sequences to be identified, 
A first embodiment of the apparatus according to the 
invention for identifying two nucleic acid base code sequen- 
ces belonging to a given set of known base code sequences and 
being superposed on each other in an original sequence which 

15 comprises base codes as well as ambiguity codes, comprises 
master template sequence constructing means for constructing 
a master template sequence from said given set of base code 
sequences by assigning every conserved position, where the 
base code is the same all through the set, that particular 

20 base code in said master template sequence, and assigning 
every non-conserved position, where the base code differs 
through the set, a wild-card code in said master template 
sequence, non-conserved position extracting means for extrac- 
ting from every base code sequence of said given set, the 

25 non-conserved positions to obtain non-conserved position sub- 
sequences containing only the non-conserved base codes, 
superposing means for superposing in pairs all possible 
combinations of the non-conserved position sequences extrac- 
ted by said non-conserved position extracting means to obtain 

30 combination sequences of base codes and ambiguity codes, 

original sequence determining means for making a determina- 
tion of the original sequence in order to obtain a test 
sequence, aligning means for aligning said test sequence 
against said master template sequence in such a manner that, 

35 accepting gaps in either sequence, the matching between them 
is optimized, said wild-card coded non-conserved positions in 
said master template sequence being considered as matching 
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any base code and any ambiguity code in said test sequence, 
base code and ambiguity code extracting means for extracting 
from said test sequence all base codes and ambiguity codes 
which are aligned with the wild-card codes in said master 
5 template sequence, and comparing means for comparing the base 
codes and ambiguity codes extracted by said base code and 
ambiguity code extracting means with all the combination 
sequences of base codes and ambiguity codes obtained by means 
of said superposing means, a match between one of said 

10 combination sequences obtained by means of said superposing 
means and the base codes and ambiguity codes extracted by 
said base code and ambiguity code extracting means, indica- 
ting that that particular combination sequence of base codes 
and ambiguity codes corresponds to said two nucleic acid base 

15 code sequences to be identified. 

A second embodiment of the apparatus according to the 
invention for identifying two nucleic acid base code sequen- 
ces belonging to a given set of known base code sequences and 
being superposed on each other in an original sequence which 

20 comprises base codes as well as ambiguity codes, comprises 

master template sequence constructing means for constructing 
a master template sequence from said given set of base code 
sequences by assigning every conserved position, where the 
base code is the same all through the set, that particular 

25 base code in said master template sequence, and assigning 
every non-conserved position, where the base code differs 
through the set, a wild-card code in said master template 
sequence, non-conserved position extracting means for extrac- 
ting from every base code sequence of said given set, the 

30 non-conserved positions to obtain non-conserved position 
subsequences containing only the non-conserved base codes, 
superposing means for superposing, in pairs, all possible 
combinations of the non-conserved position sequences extrac- 
ted by said non-conserved position extracting means to obtain 

35 combination sequences of base codes and ambiguity codes, 

original sequence determining means for making one or more 
determinations of the original sequence in order to obtain 
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one or more test sequences, aligning means for aligning each 
of said one or more test sequences against said master 
template sequence in such a manner that, accepting gaps in 
either sequence, the matching between the master template and 
5 each test sequence is optimized, said wild-card coded non- 
conserved positions in said master template sequence being 
considered as matching any base code and any ambiguity code 
in each test sequence, base code and ambiguity code extrac- 
ting means for extracting from each of said test sequences 

10 all base codes and ambiguity codes which are aligned with the 
wild-card codes in said master template sequence, determining 
means for determining, for each non-conserved position, a 
consensus base code or ambiguity code on the basis of the 
non-conserved bases extracted from each test sequence by 

15 summing up a score for each base code for each non-conserved 
position and keeping the base code with the highest score, 
the score being a function of the position of the base code 
in the respective test sequence as well as of the local 
quality of the alignment between the respective test sequence 

20 and said master template sequence, and comparing means for 
comparing the consensus base codes and ambiguity codes 
determined by said determining means with all the combination 
sequences of base codes and ambiguity codes obtained by means 
of said superposing means, a match between one of said 

25 combination sequences obtained by means of said superposing 
means and the consensus base codes and ambiguity codes 
determined by said determining means, indicating that that 
particular combination sequence of base codes and ambiguity 
codes corresponds to said two nucleic acid base code sequen- 

30 ces to be identified. 

DESCRIPTION OF PREFERRED EMBODIMENTS 

In the following description, A, C, G and T stand for 
adenine, cytosine, guanine and thymine, respectively, while 
3 5 other one-letter codes stand for combinations of nucleotides 
at the same position as defined by Nomenclature Committee of 
the International Union of Biochemistry (NC-IUB) : Nomen- 
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clature for incompletely specified bases in nucleic acid 
sequences. Eur J Biochem 150:1, 1985 as follows: 
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15 In the method according to the invention, one or more 

determinations of an original sequence are made in order to 
obtain one or more test sequences. The test sequences are 
obtained in a manner known per se by means of sequencing 
equipment, and are to be analyzed in order to identify the 

20 two nucleic acid base code sequences which, superposed on 
each other, make up the original sequence. 

To accomplish this, the starting point is a given set of 
alternative base code sequences (alleles) for a gene in the 
HLA complex. For this example, the following set of three 

25 alternative base code sequences or subtypes could be used: 

Subtype 1 ACC GCT GAT CCC TGT CG 

Subtype 2 A TG C G- 

Subtype 3 C — 

30 

According to the nomenclature above, the first subtype is 
explicitely defined, while merely deviations from the first 
subtype are indicated for the other two subtypes. 

It is to be understood that, in practice, the number of 
35 subtypes is very large. 

According to the invention a master template sequence is 
constructed from the above given set of subtypes by assigning 
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every conserved position , i.e. every position where the base 
code is the same all through the set, that particular base 
code in said roaster template sequence, while every non- 
conserved position, i.e. every position where the base code 
5 differs through the set, is assigned a wild-card code corre- 
sponding to $ in said master template sequence. 

Applying this to the above given set of just three base 
code sequences, the master template sequence will be as 
follows: 
10 ACCGC$$$TCCCTG$$G . 

According to the invention, also the non-conserved 
positions are extracted from every base code sequence in the 
above given set in order to obtain a corresponding set of 
15 non-conserved position subsequences which only contain the 
non-conserved base codes. 

Applying this to the above given set of subtypes, the 
following three non-conserved position subsequences are 
obtained: 
20 1. TGATC 

2. ATGCG 

3. TGACC. 

According to the invention, all possible combinations of 
25 the above non-conserved position sequences are superposed in 
pairs in order to obtain combination sequences of base codes 
and ambiguity codes. 

For the above three non-conserved position sequences, the 
following combination sequences are obtained. 

30 

Combination 

1/1 TGATC 
1/2 WKRYS 
1/3 TGAYC 
35 2/2 ATGCG 

2/3 WKRCS 
3/3 TGACC 
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In accordance with the invention, a test sequence, 
obtained as indicated above, is then aligned with the master 
template sequence in such a manner that, accepting gaps in 
either sequence, the matching between the test sequence and 
5 the master template sequence, is obtimized. 

For this alignment, a dynamic programming algorithm 
described by Sigvard Needleman and C. Wunsch, J. Mol. Biol. 
48, 444 (1970), may be used. 

This algorithm functions so that all types of alignments 

10 between the two sequences are given points. This is accom- 
plished in that different points are awarded e.g. for mat- 
ching position, mismatching position, inserted or removed 
characters etc. The alignment that obtains the highest number 
of points, is kept. 

15 According to the invention, also the wild-card code 

introduced in accordance with the invention, gives matching 
points in combination with any character in the other sequen- 
ce. Thus, the master template sequence will have the function 
of pointing out non-conserved positions in the test sequence 

20 based on the local appearance of the alignment between the 
sequences. This will function despite different forms of 
artifacts (inserted, removed and/or exchanged characters) in 
the conserved regions and without actual knowledge of where 
the respective test sequence starts. 

25 According to a first embodiment, it is supposed that the 

below single test sequence has been obtained: 



CGGTATCGCWKRTCCCTGCSGGAT . 



30 Aligning the above test sequence and the master template 

sequence in the above manner would give the following result 



CGGT T GAT 
Test sequence A CGCWKRTCCCTGCSG 

35 Master template sequence A CGC$$$TCCCTG$$G 

C 
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According to the invention, all base codes and ambiguity 
codes which are aligned with the wild-card codes in the 
master template sequence, are then extracted, which gives the 
following sequence: 



This extracted sequence of base codes and ambiquity codes 
is then compared with all the above combination sequences of 
base codes and ambiguity codes. 

A match between one of said combination sequences and the 
extracted sequence of base codes and ambiguity codes, in- 
dicate that that particular combination sequence corresponds 
to the two nucleic acid base code sequences to be identified. 

In this case, the combination 2/3 above corresponds 
exactly with the extracted sequence, which means that the two 
nucleic acid base code sequences superposed on each other, in 
other words, the two HLA alleles for a certain gene present 
in the sequence obtained from a sample from a human individu- 
al, can be identified. 

Thus, in the present case, since the subsequences in the 
combination 2/3 are extracted from subtypes 2 and 3, the test 
sequence is, in fact, a superposition of subtypes 2 and 3. 

According to a second embodiment, it is supposed that the 
below two test sequences have been obtained: 

Test sequence I CGGTATCGCWKRTCCCTGCSGGAT 
Test sequence II CGGTACCGTTKRTCCCTGCSGGAT . 

Aligning the above two test sequences and the master 
template sequence would give the following results: 



WKRCS. 



CGGT T 



GAT 



Test sequence I 

Master template sequence 



A CGCWKRTCCCTGCSG 



A CGC$$$TCCCTG$$G 



C 



and 
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CGGT T GAT 
Test sequence II ACCG TKRTCCCTGCSG 

Master template sequence ACCG $$$TCCCTG$$G 

C 

As in the first embodiment, all base codes and ambiguity 
codes which are aligned with the wild-card codes in the 
master template sequence, are then extracted from each test 
sequence, which gives the following extracted sequences: 

WKRCS , and 
TKRCS . 

According to the invention, when two or more test sequen- 
ces are obtained, a consensus sequence of base codes and am- 
biguity codes is then determined from the two or more extrac 
ted sequences in the following way: 

For each non-conserved position, a score is assigned to 
all possible code types. For the first position, this gives: 

Code 1st sequence Score 2nd sequence Score Total Score 



AO 0 0 

CO 0 0 

GO 0 0 

T o 0.5-(0. 0001*5) 0.4995 

R 0 0 0 

y 0 0 0 

W 1.0-(0. 0001*5) 0 0.9995 

SO 0 0 

M0 0 0 

K 0 0 0 



The code with the highest total score, in this case 
W=0.9995, is kept for the consensus sequence. The first 
component, 0.5 and 1.0, respectively, reflects the quality of 
the local alignment in such a manner that 1.0 means that the 
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quality of the local alignment is perfect, while 0.5 means 
that the quality of the local alignment is not perfect, in 
this case, in view of the mismatch immediately to the left of 
the position in question. It should be understood that, in 
5 this example, 0.5 has been chosen to reflect a mismatch in an 
adjacent position. The second component, 0.0001*5 in both 
cases, gives a negative contribution due to the position in 
the test sequences in such a manner that a position located 
closer to the beginning of the test sequence gives a smaller 
10 negative contribution than a position located further away 
from the beginning of the test sequence. 

The next position is treated in the same way: 



Code 1st sequence Score 2nd sequence Score Total Score 

15 A 0 0 0 

CO 0 0 

GO 0 0 

TO 0 0 

R 0 0 0 

20 Y 0 0 0 

W0 0 0 

SO 0 0 

M0 0 0 



25 



K 1.0-(0. 0001*6) 1.0-(0. 0001*6) 1.9988 



The code with the highest total score, in this case 
K=1.9988, is kept for the consensus sequence. It should be 
pointed out that the total score may be used as a quality 
measure of the position inquestion. Thus, in the above two 
30 examples, the quality of K is almost as high as possible. 

Treating the rest of the positions in the same manner 
gives the following final consensus sequence: 



WKRCS. 

35 

These determined consensus base codes and ambiquity codes 
are then compared with all the above combination sequences of 
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base codes and ambiguity codes. 

As in the first embodiment, a match between one of said 
combination sequences and the extracted sequence of base 
codes and ambiguity codes, indicate that that particular 
5 combination sequence corresponds to the two nucleic acid base 
code sequences to be identified. 

Also in this second embodiment, the above combination 2/3 
corresponds exactly with the extracted sequence, which means 
that the two nucleic acid base code sequences superposed on 

10 each other, in other words, the two HLA alleles for a certain 
gene present in the sequence obtained from a sample from a 
human individual, can be identified. 

Thus, also in this case, since the subsequences in the 
combination 2/3 are extracted from subtypes 2 and 3, the test 

15 sequence is, in fact, a superposition of subtypes 2 and 3. 

It should be understood that the above second embodiment 
of the method according to the invention, with two (or more) 
test sequences, also could be applied to just a single test 
sequence. In that case, the consensus sequence would, of 

20 course, be the same as the test sequence. 

A first embodiment of an apparatus according to the 
invention for identifying two nucleic acid base code sequen- 
ces belonging to a given set of known base code sequences and 
being superposed on each other in an original sequence which 

25 comprises base codes as well as ambiguity codes, comprises 
master template sequence constructing means (not shown) for 
constructing a master template sequence from said given set 
of base code sequences by assigning every conserved position, 
where the base code is the same all through the set, that 

3 0 particular base code in said master template sequence, and 
assigning every non-conserved position, where the base code 
differs through the set, a wild-card code in said master 
template sequence, non-conserved position extracting means 
(not shown) for extracting from every base code sequence of 

3 5 said given set, the non-conserved positions to obtain non- 
conserved position subsequences containing only the non- 
conserved base codes, superposing means (not shown) for 
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superposing in pairs all possible combinations of the non- 
conserved position sequences extracted by said non-conserved 
position extracting means to obtain combination sequences of 
base codes and ambiguity codes, original sequence determining 
5 means (not shown) for making a determination of the original 
sequence in order to obtain a test sequence, aligning means 
(not shown) for aligning said test sequence against said 
master template sequence in such a manner that, accepting 
gaps in either sequence, the matching between them is optimi- 

10 zed, said wild-card coded non-conserved positions in said 
master template sequence being considered as matching any 
base code and any ambiguity code in said test sequence, base 
code and ambiguity code extracting means (not shown) for 
extracting from said test sequence all base codes and am- 

15 biguity codes which are aligned with the wild-card codes in 
said master template sequence, and comparing means (not 
shown) for comparing the base codes and ambiguity codes 
extracted by said base code and ambiguity code extracting 
means with all the combination sequences of base codes and 

20 ambiguity codes obtained by means of said superposing means, 
a match between one of said combination sequences obtained by 
means of said superposing means and the base codes and 
ambiguity codes extracted by said base code and ambiguity 
code extracting means, indicating that that particular 

25 combination sequence of base codes and ambiguity codes 

corresponds to said two nucleic acid base code sequences to 
be identified. 

A second embodiment of an apparatus according to the 
invention for identifying two nucleic acid base code sequen- 

30 ces belonging to a given set of known base code sequences and 
being superposed on each other in an original sequence which 
comprises base codes as well as ambiguity codes, comprises 
master template sequence constructing means (not shown) for 
constructing a master template sequence from said given set 

35 of base code sequences by assigning every conserved position, 
where the base code is the same all through the set, that 
particular base code in said master template sequence, and 
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assigning every non-conserved position, where the base code 
differs through the set, a wild-card code in said master 
template sequence, non-conserved position extracting means 
(not shown) for extracting from every base code sequence of 
5 said given set, the non-conserved positions to obtain non- 
conserved position subsequences containing only the non- 
conserved base codes, superposing means (not shown) for 
superposing, in pairs, all possible combinations of the non- 
conserved position sequences extracted by said non-conserved 

10 position extracting means to obtain combination sequences of 
base codes and ambiguity codes, original sequence determining 
means (not shown) for making one or more determinations of 
the original sequence in order to obtain one or more test 
sequences, aligning means (not shown) for aligning each of 

15 said one or more test sequences against said master template 
sequence in such a manner that, accepting gaps in either 
sequence, the matching between the master template and each 
test sequence is optimized, said wild-card coded non-conser- 
ved positions in said master template sequence being conside- 

20 red as matching any base code and any ambiguity code in each 
test sequence, base code and ambiguity code extracting means 
(not shown) for extracting from each of said test sequences 
all base codes and ambiguity codes which are aligned with the 
wild-card codes in said master template sequence, determining 

25 means (not shown) for determining, for each non-conserved 
position, a consensus base code or ambiguity code on the 
basis of the non-conserved bases extracted from each test 
sequence by summing up a score for each base code for each 
non-conserved position and keeping the base code with the 

30 highest score, the score being a function of the position of 
the base code in the respective test sequence as well as of 
the local quality of the alignment between the respective 
test sequence and said master template sequence, and compa- 
ring means (not sshown) for comparing the consensus base 

35 codes and ambiguity codes determined by said determining 

means with all the combination sequences of base codes and 
ambiguity codes obtained by means of said superposing means, 
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a match between one of said combination sequences obtained by 
means of said superposing means and the consensus base codes 
and ambiguity codes determined by said determining means, 
indicating that that particular combination sequence of base 
5 codes and ambiguity codes corresponds to said two nucleic 
acid base code sequences to be identified. 

The apparatuses according to the invention are preferably 
implemented in computer software. 
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CLAIMS 

1. A method for identifying two nucleic acid base code 
sequences belonging to a given set of known base code sequen- 
5 ces and being superposed on each other in an original sequen- 
ce which comprises base codes as well as ambiguity codes, 
characterized by the steps of 

a) constructing a master template sequence from said given 
set of base code sequences by assigning every conserved 

10 position, where the base code is the same all through the 
set, that particular base code in said master template 
sequence, and assigning every non-conserved position, where 
the base code differs through the set, a wild-card code in 
said master template sequence, 

15 b) extracting from every base code sequence of said given 
set, the non-conserved positions to obtain non-conserved 
position subsequences containing only the non-conserved base 
codes, 

c) superposing, in pairs, all possible combinations of the 
20 non-conserved position sequences extracted in step b) to 

obtain combination sequences of base codes and ambiguity 
codes , 

d) making a determination of the original sequence in order 
to obtain a test sequence, 

25 e) aligning said test sequence against said master template 
sequence in such a manner that, accepting gaps in either 
sequence, the matching between them is optimized, said wild- 
card coded non-conserved positions in said master template 
sequence being considered as matching any base code and any 

30 ambiguity code in said test sequence, 

f) extracting from said test sequence all base codes and 
ambiguity codes which are aligned with the wild-card codes in 
said master template sequence, and 

g) comparing the base codes and ambiguity codes extracted in 
35 step f) with all the combination sequences of base codes and 

ambiguity codes obtained in step c) , a match between one of 
said combination sequences obtained in step c) and the base 



WO 96/12822 PCT/SE95/01213 

18 

codes and ambiguity codes extracted in step f ) , indicating 
that that particular combination sequence of base codes and 
ambiguity codes corresponds to said two nucleic acid base 
code sequences to be identified. 

5 

2. A method for identifying two nucleic acid base code 
sequences belonging to a given set of known base code sequen- 
ces and being superposed on each other in an original sequen- 
ce which comprises base codes as well as ambiguity codes, 
10 characterized by the steps of 

a) constructing a master template sequence from said given 
set of base code sequences by assigning every conserved 
position, where the base code is the same all through the 
set, that particular base code in said master template 

15 sequence, and assigning every non-conserved position, where 
the base code differs through the set, a wild-card code in 
said master template sequence, 

b) extracting from every base code sequence of said given 
set, the non-conserved positions to obtain non-conserved 

20 position subsequences containing only the non-conserved base 
codes, 

c) superposing, in pairs, all possible combinations of the 
non-conserved position sequences extracted in step b) to 
obtain combination sequences of base codes and ambiguity 

25 codes, 

d) making one or more determinations of the original sequence 
in order to obtain one or more test sequences, 

e) aligning each of said one or more test sequences against 
said master template sequence in such a manner that, accep- 

30 ting gaps in either sequence, the matching between the master 
template and each test sequence is optimized, said wild-card 
coded non-conserved positions in said master template sequen- 
ce being considered as matching any base code and any am- 
biguity code in each test sequence, 

35 f) extracting from each of said test sequences all base codes 
and ambiguity codes which are aligned with the wild-card 
codes in said master template sequence, 
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g) determining, for each non-conserved position, a consensus 
base code or ambiguity code on the basis of the non-conserved 
bases extracted from each test sequence by summing up a score 
for each base code for each non-conserved position and 
5 keeping the base code with the highest score, the score being 
a function of the position of the base code in the respective 
test sequence as well as of the local quality of the align- 
ment between the respective test sequence and said master 
template sequence, and 

10 h) comparing the consensus base codes and ambiguity codes 

determined in step g) with all the combination sequences of 
base codes and ambiguity codes obtained in step c) , a match 
between one of said combination sequences obtained in step c) 
and the consensus base codes and ambiguity codes determined 

15 in step g) , indicating that that particular combination 

sequence of base codes and ambiguity codes corresponds to 
said two nucleic acid base code sequences to be identified. 

3. A method of genetic analysis, comprising the steps of 

(i) subjecting a test sample to a sequencing procedure to 
obtain two superposed base code sequences representing the 
alleles present for a specific gene, and 

(ii) identifying the base code sequences by the method 
according to claim 1 or 2. 

4. Use of the method according to claim 1 or 2 for HLA 
typing. 

5. An apparatus for identifying two nucleic acid base code 

30 sequences belonging to a given set of known base code sequen- 
ces and being superposed on each other in an original sequen- 
ce which comprises base codes as well as ambiguity codes, 
characterized in that it comprises 

- master template sequence constructing means for construc- 
35 ting a master template sequence from said given set of base 
code sequences by assigning every conserved position, where 
the base code is the same all through the set, that particu- 
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lar base code in said master template sequence, and assigning 
every non-conserved position, where the base code differs 
through the set, a wild-card code in said master template 
sequence , 

5 - non-conserved position extracting means for extracting from 
every base code sequence of said given set, the non-conserved 
positions to obtain non-conserved position subsequences 
containing only the non-conserved base codes, 

- superposing means for superposing, in pairs, all possible 
10 combinations of the non-conserved position sequences extrac- 
ted by said non-conserved position extracting means to obtain 
combination sequences of base codes and ambiguity codes, 

- original sequence determining means for making a determi- 
nation of the original sequence in order to obtain a test 

15 sequence, 

- aligning means for aligning said test sequence against said 
master template sequence in such a manner that, accepting 
gaps in either sequence, the matching between them is optimi- 
zed, said wild-card coded non-conserved positions in said 

20 master template sequence being considered as matching any 
base code and any ambiguity code in said test sequence, 

- base code and ambiguity code extracting means for extrac- 
ting from said test sequence all base codes and ambiguity 
codes which are aligned with the wild-card codes in said 

25 master template sequence, and 

- comparing means for comparing the base codes and ambiguity 
codes extracted by said base code and ambiguity code extrac- 
ting means with all the combination sequences of base codes 
and ambiguity codes obtained by means of said superposing 

30 means, a match between one of said combination sequences 
obtained by means of said superposing means and the base 
codes and ambiguity codes extracted by said base code and 
ambiguity code extracting means, indicating that that parti- 
cular combination sequence of base codes and ambiguity codes 

35 corresponds to said two nucleic acid base code sequences to 
be identified. 
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6. An apparatus for identifying two nucleic acid base code 
sequences belonging to a given set of known base code sequen- 
ces and being superposed on each other in an original sequen- 
ce which comprises base codes as well as ambiguity codes, 
5 characterized in that it comprises 

- master template sequence constructing means for construc- 
ting a master template sequence from said given set of base 
code sequences by assigning every conserved position, where 
the base code is the same all through the set, that particu- 

10 lar base code in said master template sequence, and assigning 
every non-conserved position, where the base code differs 
through the set, a wild-card code in said master template 
sequence, 

- non-conserved position extracting means for extracting from 
15 every base code sequence of said given set, the non-conserved 

positions to obtain non-conserved position subsequences 
containing only the non-conserved base codes, 

- superposing means for superposing, in pairs, all possible 
combinations of the non-conserved position sequences extrac- 

20 ted by said non-conserved position extracting means to obtain 
combination sequences of base codes and ambiguity codes, 

- original sequence determining means for making one or more 
determinations of the original sequence in order to obtain 
one or more test sequences, 

25 - aligning means for aligning each of said one or more test 
sequences against said master template sequence in such a 
manner that, accepting gaps in either sequence, the matching 
between the master template and each test sequence is optimi- 
zed, said wild-card coded non-conserved positions in said 

30 master template sequence being considered as matching any 
base code and any ambiguity code in each test sequence, 

- base code and ambiguity code extracting means for extrac- 
ting from each of said test sequences all base codes and 
ambiguity codes which are aligned with the wild-card codes in 

35 said master template sequence, 

- determining means for determining, for each non-conserved 
position, a consensus base code or ambiguity code on the 
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basis of the non-conserved bases extracted from each test 
sequence by summing up a score for each base code for each 
non-conserved position and keeping the base code with the 
highest score, the score being a function of the position of 
5 the base code in the respective test sequence as well as of 
the local quality of the alignment between the respective 
test sequence and said master template sequence, and 
- comparing means for comparing the consensus base codes and 
ambiguity codes determined by said determining means with all 

10 the combination sequences of base codes and ambiguity codes 
obtained by means of said superposing means, a match between 
one of said combination sequences obtained by means of said 
superposing means and the consensus base codes and ambiguity 
codes determined by said determining means, indicating that 

15 that particular combination sequence of base codes and 

ambiguity codes corresponds to said two nucleic acid base 
code sequences to be identified. 
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